Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 22, 2025

📄 34% (0.34x) speedup for ChromaLangchainEmbeddingFunction.get_config in chromadb/utils/embedding_functions/chroma_langchain_embedding_function.py

⏱️ Runtime : 60.6 microseconds 45.1 microseconds (best of 44 runs)

📝 Explanation and details

The optimization replaces the repeated dictionary construction in get_config() with a pre-computed dictionary stored during initialization.

Key changes:

  • Pre-computation: The configuration dictionary is built once in __init__ and stored in self._config
  • Direct return: get_config() now simply returns the cached dictionary instead of rebuilding it each time

Why this is faster:
Dictionary construction in Python involves memory allocation and key-value pair creation overhead. By moving this work to initialization time (which happens once per instance), we eliminate this overhead from the frequently-called get_config() method. The line profiler shows the original version spent significant time on dictionary creation (218μs total), while the optimized version only needs a simple attribute access (71μs total).

Performance characteristics:
The optimization provides consistent 20-46% speedups across all test scenarios, with particularly strong gains when:

  • get_config() is called repeatedly on the same instance (39.2% faster with 500 calls)
  • Multiple instances with different class names are used (34.5% faster with 300 instances)
  • Large class names are involved (38.4% faster with 500-character names)

This is a classic space-time tradeoff that favors performance when get_config() is called multiple times per instance, which appears to be the common usage pattern based on the test cases.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 721 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import sys
import types
from typing import Any, Dict, Sequence, Union, cast

# function to test
import numpy as np
# imports
import pytest  # used for our unit tests
from chromadb.utils.embedding_functions.chroma_langchain_embedding_function import \
    ChromaLangchainEmbeddingFunction


# Minimal stub for langchain_core.embeddings.Embeddings for testing
class DummyEmbeddings:
    def __init__(self, config_val=None):
        self.config_val = config_val

    def embed_documents(self, docs):
        # Just return a list of lists of floats, one per doc
        return [[float(i)] * 3 for i, _ in enumerate(docs)]

    def embed_image(self, images):
        # Just return a list of lists of floats, one per image
        return [[float(i)] * 5 for i, _ in enumerate(images)]

langchain_core = types.ModuleType("langchain_core")
embeddings_mod = types.ModuleType("langchain_core.embeddings")
embeddings_mod.Embeddings = DummyEmbeddings
langchain_core.embeddings = embeddings_mod
sys.modules["langchain_core"] = langchain_core
sys.modules["langchain_core.embeddings"] = embeddings_mod
from chromadb.utils.embedding_functions.chroma_langchain_embedding_function import \
    ChromaLangchainEmbeddingFunction

# unit tests

# ----------- Basic Test Cases -----------

def test_get_config_returns_expected_keys():
    """Test that get_config returns the expected keys and values for a basic embedding function."""
    emb = DummyEmbeddings()
    clf = ChromaLangchainEmbeddingFunction(emb)
    codeflash_output = clf.get_config(); config = codeflash_output # 612ns -> 503ns (21.7% faster)

def test_get_config_with_different_class_names():
    """Test that get_config reflects the actual class name."""
    class CustomEmbeddings(DummyEmbeddings):
        pass
    emb = CustomEmbeddings()
    clf = ChromaLangchainEmbeddingFunction(emb)
    codeflash_output = clf.get_config(); config = codeflash_output # 489ns -> 340ns (43.8% faster)


def test_get_config_with_subclass_instance():
    """Test get_config with a subclass of DummyEmbeddings."""
    class SubEmbeddings(DummyEmbeddings):
        pass
    emb = SubEmbeddings()
    clf = ChromaLangchainEmbeddingFunction(emb)
    codeflash_output = clf.get_config(); config = codeflash_output # 435ns -> 351ns (23.9% faster)

def test_get_config_after_modifying_embedding_function():
    """Test that get_config reflects the class at construction, not after monkey-patching."""
    emb = DummyEmbeddings()
    clf = ChromaLangchainEmbeddingFunction(emb)
    emb.__class__.__name__ = "PatchedEmbeddings"
    codeflash_output = clf.get_config(); config = codeflash_output # 489ns -> 335ns (46.0% faster)




def test_get_config_with_non_embedding_instance():
    """Test that __init__ raises ValueError if not passed a langchain_core.embeddings.Embeddings instance."""
    class NotEmbeddings:
        pass
    with pytest.raises(ValueError):
        ChromaLangchainEmbeddingFunction(NotEmbeddings())

def test_get_config_with_missing_langchain_core(monkeypatch):
    """Test that __init__ raises ValueError if langchain_core is not installed."""
    monkeypatch.setitem(sys.modules, "langchain_core", None)
    monkeypatch.setitem(sys.modules, "langchain_core.embeddings", None)
    class Dummy:
        pass
    with pytest.raises(ValueError):
        ChromaLangchainEmbeddingFunction(Dummy())

# ----------- Large Scale Test Cases -----------

def test_get_config_many_instances_unique_class_names():
    """Test get_config with many different embedding function classes."""
    num_classes = 100
    clfs = []
    for i in range(num_classes):
        # Dynamically create a class with a unique name
        cls = type(f"Embeddings_{i}", (DummyEmbeddings,), {})
        emb = cls()
        clf = ChromaLangchainEmbeddingFunction(emb)
        clfs.append(clf)
    # All configs should have unique class names
    class_names = set(clf.get_config()["embedding_function_class"] for clf in clfs) # 623ns -> 521ns (19.6% faster)

def test_get_config_performance_large_number_of_calls():
    """Test get_config performance and correctness with many calls."""
    emb = DummyEmbeddings()
    clf = ChromaLangchainEmbeddingFunction(emb)
    configs = [clf.get_config() for _ in range(500)] # 547ns -> 393ns (39.2% faster)
    for config in configs:
        pass

def test_get_config_with_large_class_name():
    """Test get_config with a very large class name."""
    large_name = "X" * 500
    cls = type(large_name, (DummyEmbeddings,), {})
    emb = cls()
    clf = ChromaLangchainEmbeddingFunction(emb)
    codeflash_output = clf.get_config(); config = codeflash_output # 487ns -> 352ns (38.4% faster)

def test_get_config_with_many_different_embedding_functions():
    """Test get_config with many different embedding function instances."""
    embs = [DummyEmbeddings(i) for i in range(300)]
    clfs = [ChromaLangchainEmbeddingFunction(emb) for emb in embs]
    for clf in clfs:
        codeflash_output = clf.get_config(); config = codeflash_output # 56.2μs -> 41.8μs (34.5% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import sys
import types
from typing import Any, Dict, Sequence, Union, cast

# function to test
import numpy as np
# imports
import pytest
from chromadb.utils.embedding_functions.chroma_langchain_embedding_function import \
    ChromaLangchainEmbeddingFunction


# Minimal stubs to allow the test to run without actual langchain_core/chromadb
class EmbeddingFunction:
    pass

Documents = Sequence[str]
Images = Sequence[str]
Embeddings = Sequence[Sequence[float]]

# Minimal stub for langchain_core.embeddings.Embeddings
class DummyLangchainEmbeddings:
    def __init__(self):
        pass
    def embed_documents(self, docs):
        return [[1.0]*3 for _ in docs]
    def embed_image(self, images):
        return [[2.0]*3 for _ in images]


langchain_core_module = types.ModuleType("langchain_core")
embeddings_module = types.ModuleType("langchain_core.embeddings")
embeddings_module.Embeddings = DummyLangchainEmbeddings
langchain_core_module.embeddings = embeddings_module
sys.modules["langchain_core"] = langchain_core_module
sys.modules["langchain_core.embeddings"] = embeddings_module
from chromadb.utils.embedding_functions.chroma_langchain_embedding_function import \
    ChromaLangchainEmbeddingFunction

# ------------------ UNIT TESTS BELOW ------------------

# 1. Basic Test Cases











def test_init_raises_on_wrong_type():
    """Test that __init__ raises ValueError if embedding_function is not a langchain_core Embeddings."""
    class NotLangchain:
        pass
    with pytest.raises(ValueError) as e:
        ChromaLangchainEmbeddingFunction(embedding_function=NotLangchain())

def test_init_raises_on_missing_langchain_core(monkeypatch):
    """Test that __init__ raises ValueError if langchain_core is not installed."""
    # Remove langchain_core from sys.modules temporarily
    monkeypatch.setitem(sys.modules, "langchain_core", None)
    monkeypatch.setitem(sys.modules, "langchain_core.embeddings", None)
    class Dummy:
        pass
    with pytest.raises(ValueError) as e:
        ChromaLangchainEmbeddingFunction(embedding_function=Dummy())
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from chromadb.utils.embedding_functions.chroma_langchain_embedding_function import ChromaLangchainEmbeddingFunction

To edit these changes git checkout codeflash/optimize-ChromaLangchainEmbeddingFunction.get_config-mh2k6290 and push.

Codeflash

The optimization replaces the repeated dictionary construction in `get_config()` with a pre-computed dictionary stored during initialization. 

**Key changes:**
- **Pre-computation**: The configuration dictionary is built once in `__init__` and stored in `self._config`
- **Direct return**: `get_config()` now simply returns the cached dictionary instead of rebuilding it each time

**Why this is faster:**
Dictionary construction in Python involves memory allocation and key-value pair creation overhead. By moving this work to initialization time (which happens once per instance), we eliminate this overhead from the frequently-called `get_config()` method. The line profiler shows the original version spent significant time on dictionary creation (218μs total), while the optimized version only needs a simple attribute access (71μs total).

**Performance characteristics:**
The optimization provides consistent 20-46% speedups across all test scenarios, with particularly strong gains when:
- `get_config()` is called repeatedly on the same instance (39.2% faster with 500 calls)
- Multiple instances with different class names are used (34.5% faster with 300 instances)
- Large class names are involved (38.4% faster with 500-character names)

This is a classic space-time tradeoff that favors performance when `get_config()` is called multiple times per instance, which appears to be the common usage pattern based on the test cases.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 22, 2025 22:22
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants